September 18, 2016
Data Import
Tidying Data
Manipulating Data
The tidyverse includes a number of easy-to-use packages for importing data:
readr for text files (including .csv and .tsv files)readxl for Excel fileshaven for Stata, SPSS, and SAS filesload() loads .RData/.rda filesThe tidyverse includes a number of easy-to-use packages for importing data:
readr for text files (including .csv and .tsv files)
readxl for Excel files
haven for Stata, SPSS, and SAS files
and of course base-R load() loads .RData/.rda files
rio package provides a one-size-fits-all shortcut: import()Getting data into R requires us to have data.
For our own survey or experiment, we can just publish the data
For others’ data, though, we should provide replicable access
If there’s a static link directly to the file you need, such as this income inequality data from the Australian Bureau of Statistics, automating the download is straightforward
abs_link <- "http://www.abs.gov.au/AUSSTATS/subscriber.nsf/
log?openagent&6523DO00001_201314.xls&6523.0&Data%20Cubes&
4F00682720AFA825CA257EB5001B77B9&0&2013-14&16.12.2015&Latest"
download.file(url = abs_link, destfile = "cmcr04_files/abs.xls")
abs_link <- "http://www.abs.gov.au/AUSSTATS/subscriber.nsf/ log?openagent&6523DO00001_201314.xls&6523.0&Data%20Cubes& 4F00682720AFA825CA257EB5001B77B9&0&2013-14&16.12.2015&Latest" download.file(url = abs_link, destfile = "cmcr04_files/abs.xls")
It can be a good idea to make downloads conditional:
if (!file.exists("cmcr04_files/abs.xls")) {
abs_link <- "http://www.abs.gov.au/AUSSTATS/subscriber.nsf/
log?openagent&6523DO00001_201314.xls&6523.0&Data%20Cubes&
4F00682720AFA825CA257EB5001B77B9&0&2013-14&16.12.2015&Latest"
download.file(url = abs_link, destfile = "cmcr04_files/abs.xls")
}
download.file() won’t workgesisgesisgesisgesisgesisgesislibrary(gesis)
fs <- login(username = "frederick-solt@uiowa.edu",
password = "not_my_real_password!")
download_dataset(s = fs,
doi = "6643",
path = "cmcr04_files",
purpose = 1) # "1. for scientific research"
## Downloading DOI: 6643
list.files(path = "cmcr04_files", pattern="ZA.*")
## [1] "ZA6643_v2-0-0.dta"
gesisNote that you should actually save your username and password in your .Renviron as “GESIS_USER” and “GESIS_PASS” to keep your information private.
gesisNote that you should actually save your username and password in your .Rprofile as “gesis_user” and “gesis_pass” to keep your information private.
library(gesis)
fs <- login(getOption("gesis_user"), getOption("gesis_pass"))
download_dataset(s = fs,
doi = "6643",
path = "cmcr04_files",
purpose = 1) # "1. for scientific research"
## Downloading DOI: 6643
CANSIM2R: Directly Extracts Complete CANSIM Data Tables. “Extract CANSIM (Statistics Canada) tables and transform them into readily usable data in panel (wide) format.” Use CANSIM2R:::downloadCANSIM() to just get the data.pxweb: R Interface to the PX-Web/PC-Axis API. “The PX-Web/PC-Axis API is used by organizations such as Statistics Sweden and Statistics Finland to disseminate data.”filter()arrange()select()mutate() / transmute()group_by() + summarize()filter()
arrange()
select()
mutate() / transmute()
group_by() + summarize()
Remember that Australian Excel file we grabbed with the static link?
It’s got a problem:
It isn’t tidy.
This means that it won’t play well with others.
Remember that Australian Excel file we grabbed with the static link?
It won’t play well with others, so we’ll have to tidy it before use.
library(rio)
ineq <- import("cmcr04_files/abs.xls", sheet = "Table 1.1", skip = 4)
library(rio)
ineq <- import("cmcr04_files/abs.xls", sheet = "Table 1.1", skip = 4)
View(ineq)
First step: give names to those first two columns!
names(ineq)
## [1] "..1" "..2" "1994–95" "1995–96" ## [5] "1996–97" "1997–98" "1999–2000" "2000–01" ## [9] "2002–03" "2003–04(a)" "2005–06(a)" "2007–08(a)" ## [13] "2009–10(a)" "2011–12(a)" "2013–14(a)"
names(ineq)[1:2] <- c("var", "unit")
names(ineq)
## [1] "var" "unit" "1994–95" "1995–96" ## [5] "1996–97" "1997–98" "1999–2000" "2000–01" ## [9] "2002–03" "2003–04(a)" "2005–06(a)" "2007–08(a)" ## [13] "2009–10(a)" "2011–12(a)" "2013–14(a)"
Second step: gather the (Gini) data
library(tidyr)
gathered <- ineq %>%
filter(var=="Gini coefficient") %>%
mutate(var = ifelse(unit=="RSE(%)", "gini_rse", "gini")) %>%
select(-unit) %>%
gather(key = year, value = val, -var)
Second step: gather the (Gini) data
Third step: spread the data again
spreaded <- gathered %>%
spread(key = var, value = val)
Third step: spread the data again
It’s tidy now!
It’s tidy, but its still not clean. Fourth step: clean up
library(stringr) # more on stringr next week!
ineq_tidy <- spreaded %>%
transmute(country = "Australia",
year = str_replace(year, "\\(a\\)", ""),
year = ifelse(str_extract(year, "\\d{2}$") %>%
as.numeric() > 50,
str_extract(year, "\\d{2}$") %>%
as.numeric() + 1900,
str_extract(year, "\\d{2}$") %>%
as.numeric() + 2000),
gini = as.numeric(gini),
gini_se = as.numeric(gini_rse)/100 * gini)
It’s tidy, but its still not clean. Fourth step: clean up